178 ◾ Bioinformatics
Yg
g
λ(
)
~Poi
(5.12)
where
g
λ is the Poisson parameter which represents the rate of change in the count of the
gene g in a sample.
The Poisson distribution assumes that the rate of change is equal to the mean,
g
λ , which
is also equal to the variance.
Y
g
g
g
µ
λ
(
)
=
= var
(5.13)
The probability that Y
y
g
g
=
p Y
y
e
y
g
g
g
g
y
g
g
g
λ
λ
(
)
=
=
×
λ
−
;
!
(5.14)
To model the RNA-Seq count data with the Poisson distribution it requires that the mean
is equal to the variance. A key challenge is the small number of replicates in typical RNA-
Seq experiments (two or three replicates per condition). Therefore, inferential methods
that deal with each gene separately may suffer, in this case, from lack of power, due to
the high uncertainty of within-group variance estimates. This challenge can be overcome
either by grouping the count data into groups and then calculating the variance and the
mean in each group or by pooling information across genes by assuming the similarity of
the variances of different genes measured in the same experiment. In general, RNA-Seq
count data suffers from over-dispersion, where variance is greater than the mean. There
are a variety of software that use different technique for modeling the RNA-Seq count
data, but most of them use quasi-Poisson, negative binomial, or quasi-negative binomial
distribution, which deal with over-dispersed data.
The quasi-Poisson is similar to the Poisson distribution, but the variance is linearly cor-
related to the mean of the counts [31].
Y
q
g
g
g
µ θ
(
)
~ Poi
,
(5.15)
Yg
g
g
θ µ
(
)=
var
(5.16)
where
g
θ is the dispersion parameter and
g
µ is the mean count.
In the negative binomial distribution, the variance is the function of the mean as
Y
NB
g
g
g
µ α
(
)
~
,
(5.17)
Yg
g
g
P
µ
αµ
(
)=
+
var
(5.18)
where α is the dispersion parameter and P is an integer but commonly we use P = 2 (NB2
or quadratic model).